Skip to content

test(pt_expt): tolerance for test_default_fallback (CUDA scatter nondeterminism)#5508

Merged
njzjz merged 1 commit into
deepmodeling:masterfrom
wanghan-iapcm:fix-default-fallback-dpa3-virial
Jun 9, 2026
Merged

test(pt_expt): tolerance for test_default_fallback (CUDA scatter nondeterminism)#5508
njzjz merged 1 commit into
deepmodeling:masterfrom
wanghan-iapcm:fix-default-fallback-dpa3-virial

Conversation

@wanghan-iapcm

@wanghan-iapcm wanghan-iapcm commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

Summary

test_default_fallback (added in #5491) compared neighbor_list=None against an explicit DefaultNeighborList() with exact equality (assert_array_equal). The two are the identical builder — call_common does builder = nl if nl is not None else DefaultNeighborList() — so on CPU the two forward passes are bit-identical. But the test runs two independent forward passes, and on CUDA the dpa3 GNN message-passing scatter (atomic adds) is not bit-reproducible run-to-run. The virial differed by ~1 ULP (abs 3.46e-18, rel 5.64e-16), failing the exact comparison intermittently:

FAILED test_default_fallback[dpa3] - AssertionError: Arrays are not equal
dpa3 virial — Mismatched elements: 1 / 9 (11.1%)
Max absolute difference: 3.47e-18 ; Max relative difference: 5.64e-16

Fix

Switch the comparison to assert_allclose(rtol=1e-10, atol=1e-12) — far above the fp noise floor (~1e-16 rel) but orders of magnitude tighter than any real dispatch divergence (e.g. accidentally using a different builder) would produce. Verified on CPU that None and DefaultNeighborList() are bit-identical (max|Δ| = 0), confirming the residual is CUDA atomic nondeterminism, not a dispatch bug.

Known limitations

  • Validated on CPU (8/8 test_default_fallback pass). The failing CUDA path could not be re-validated locally (no GPU available this session); the CI CUDA job is the confirmation. The tolerance is ~6 orders of magnitude above the observed noise, so it is robust.
  • Relaxes the previously-exact comparison; this is the intended change. The sibling test_pt_expt_equivalence tests already use a 1e-9 tolerance.

Summary by CodeRabbit

Release Notes

  • Tests
    • Enhanced test robustness by implementing floating-point tolerance comparisons instead of strict equality checks, ensuring reliable testing across different hardware configurations.

…eterminism)

test_default_fallback compared neighbor_list=None against an explicit
DefaultNeighborList() with exact equality (assert_array_equal). The two are the
identical builder, so on CPU the results are bit-identical, but the test runs
two independent forward passes and on CUDA the dpa3 GNN message-passing scatter
(atomic adds) is not bit-reproducible run-to-run -- the virial differed by ~1
ULP (abs 3.5e-18, rel 5.6e-16), failing the exact comparison intermittently.

Switch to assert_allclose with a tight tolerance (rtol=1e-10, atol=1e-12), far
above the fp noise floor but orders of magnitude tighter than any real dispatch
divergence would be.
@github-actions github-actions Bot added the Python label Jun 9, 2026
@coderabbitai

coderabbitai Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: c87ddf32-513f-42a1-bdfb-d09c4f277d96

📥 Commits

Reviewing files that changed from the base of the PR and between e05af21 and 723a936.

📒 Files selected for processing (1)
  • source/tests/pt_expt/utils/test_neighbor_list.py

📝 Walkthrough

Walkthrough

Test utility updates the test_default_fallback function to accept ULP-level numerical variance in CUDA scatter/atomic operations by switching from strict equality to tolerance-based assertions (assert_allclose), with clarifying docstring explaining the dispatch behavior and tolerance rationale.

Changes

Neighbor List Fallback Test Tolerance

Layer / File(s) Summary
Test fallback dispatch and tolerance adjustments
source/tests/pt_expt/utils/test_neighbor_list.py
test_default_fallback docstring clarifies that neighbor_list=None dispatches to DefaultNeighborList() and documents expected ULP-level CUDA scatter/atomic variance. Assertions for energy, force, and virial outputs switch from strict equality to np.testing.assert_allclose with rtol=1e-10 and atol=1e-12.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Suggested labels

Python

Suggested reviewers

  • iProzd
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically summarizes the main change: updating the test to use tolerances instead of exact equality due to CUDA nondeterminism in scatter operations. It directly reflects the primary modification in the changeset.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@wanghan-iapcm wanghan-iapcm requested a review from njzjz June 9, 2026 04:05
@codecov

codecov Bot commented Jun 9, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 81.42%. Comparing base (e05af21) to head (723a936).

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5508      +/-   ##
==========================================
- Coverage   81.42%   81.42%   -0.01%     
==========================================
  Files         871      871              
  Lines       96951    96951              
  Branches     4241     4243       +2     
==========================================
- Hits        78941    78938       -3     
- Misses      16708    16710       +2     
- Partials     1302     1303       +1     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@njzjz njzjz added this pull request to the merge queue Jun 9, 2026
Merged via the queue into deepmodeling:master with commit 5f848ad Jun 9, 2026
69 of 70 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants